45 research outputs found
Weak consistency of the 1-nearest neighbor measure with applications to missing data
When data is partially missing at random, imputation and importance weighting
are often used to estimate moments of the unobserved population. In this paper,
we study 1-nearest neighbor (1NN) importance weighting, which estimates moments
by replacing missing data with the complete data that is the nearest neighbor
in the non-missing covariate space. We define an empirical measure, the 1NN
measure, and show that it is weakly consistent for the measure of the missing
data. The main idea behind this result is that the 1NN measure is performing
inverse probability weighting in the limit. We study applications to missing
data and mitigating the impact of covariate shift in prediction tasks
Changepoint Detection over Graphs with the Spectral Scan Statistic
We consider the change-point detection problem of deciding, based on noisy
measurements, whether an unknown signal over a given graph is constant or is
instead piecewise constant over two connected induced subgraphs of relatively
low cut size. We analyze the corresponding generalized likelihood ratio (GLR)
statistics and relate it to the problem of finding a sparsest cut in a graph.
We develop a tractable relaxation of the GLR statistic based on the
combinatorial Laplacian of the graph, which we call the spectral scan
statistic, and analyze its properties. We show how its performance as a testing
procedure depends directly on the spectrum of the graph, and use this result to
explicitly derive its asymptotic properties on few significant graph
topologies. Finally, we demonstrate both theoretically and by simulations that
the spectral scan statistic can outperform naive testing procedures based on
edge thresholding and testing
Detecting Activations over Graphs using Spanning Tree Wavelet Bases
We consider the detection of activations over graphs under Gaussian noise,
where signals are piece-wise constant over the graph. Despite the wide
applicability of such a detection algorithm, there has been little success in
the development of computationally feasible methods with proveable theoretical
guarantees for general graph topologies. We cast this as a hypothesis testing
problem, and first provide a universal necessary condition for asymptotic
distinguishability of the null and alternative hypotheses. We then introduce
the spanning tree wavelet basis over graphs, a localized basis that reflects
the topology of the graph, and prove that for any spanning tree, this approach
can distinguish null from alternative in a low signal-to-noise regime. Lastly,
we improve on this result and show that using the uniform spanning tree in the
basis construction yields a randomized test with stronger theoretical
guarantees that in many cases matches our necessary conditions. Specifically,
we obtain near-optimal performance in edge transitive graphs, -nearest
neighbor graphs, and -graphs
Estimating Graphlet Statistics via Lifting
Exploratory analysis over network data is often limited by the ability to
efficiently calculate graph statistics, which can provide a model-free
understanding of the macroscopic properties of a network. We introduce a
framework for estimating the graphlet count---the number of occurrences of a
small subgraph motif (e.g. a wedge or a triangle) in the network. For massive
graphs, where accessing the whole graph is not possible, the only viable
algorithms are those that make a limited number of vertex neighborhood queries.
We introduce a Monte Carlo sampling technique for graphlet counts, called {\em
Lifting}, which can simultaneously sample all graphlets of size up to
vertices for arbitrary . This is the first graphlet sampling method that can
provably sample every graphlet with positive probability and can sample
graphlets of arbitrary size . We outline variants of lifted graphlet counts,
including the ordered, unordered, and shotgun estimators, random walk starts,
and parallel vertex starts. We prove that our graphlet count updates are
unbiased for the true graphlet count and have a controlled variance for all
graphlets. We compare the experimental performance of lifted graphlet counts to
the state-of-the art graphlet sampling procedures: Waddling and the pairwise
subgraph random walk
Fused Density Estimation: Theory and Methods
In this paper we introduce a method for nonparametric density estimation on
geometric networks. We define fused density estimators as solutions to a total
variation regularized maximum-likelihood density estimation problem. We provide
theoretical support for fused density estimation by proving that the squared
Hellinger rate of convergence for the estimator achieves the minimax bound over
univariate densities of log-bounded variation. We reduce the original
variational formulation in order to transform it into a tractable,
finite-dimensional quadratic program. Because random variables on geometric
networks are simple generalizations of the univariate case, this method also
provides a useful tool for univariate density estimation. Lastly, we apply this
method and assess its performance on examples in the univariate and geometric
network setting. We compare the performance of different optimization
techniques to solve the problem, and use these results to inform
recommendations for the computation of fused density estimators
Variance function estimation in high-dimensions
We consider the high-dimensional heteroscedastic regression model, where the
mean and the log variance are modeled as a linear combination of input
variables. Existing literature on high-dimensional linear regres- sion models
has largely ignored non-constant error variances, even though they commonly
occur in a variety of applications ranging from biostatis- tics to finance. In
this paper we study a class of non-convex penalized pseudolikelihood estimators
for both the mean and variance parameters. We show that the Heteroscedastic
Iterative Penalized Pseudolikelihood Optimizer (HIPPO) achieves the oracle
property, that is, we prove that the rates of convergence are the same as if
the true model was known. We demonstrate numerical properties of the procedure
on a simulation study and real world data.Comment: Appearing in Proceedings of the 29 th International Conference on
Machine Learning, Edinburgh, Scotland, UK, 201